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1  Introduction 

Finding  a  set  of  representative  vectors  for  clouds  of 
multi-dimensional  data  is  an  important  issue  in  data 
compression  [9],  signal  coding  [9,  8],  pattern  classifica¬ 
tion  [5]  and  function  approximation  tasks  [22,  24].  These 
centers  are  said  to  be  representative  in  the  following 
sense;  given  a  set  of  points  in  D-dimensional  Eu¬ 
clidean  space,  X  =  {i,, i  =  l,...Ar},  a  set  of  M  cen¬ 
ters  {mi, , mu),  m*  e  R® ,  ib  =  1, . . . ,  M,  partitions 
X  into  M  sets  {Si, . .  .,Sm]  ,  where  each  set  5*  corre¬ 
sponds  to  those  points  in  X  mside  the  Voronoi  polytope 
[8]  of  center  m*: 


Sk  =  {»•  :  Iki  -  mtll  <  ||xi  -  mj||,j  ^  *} 

where  ||  ■  ||  is  the  Euclidean  norm.  The  set  of  centers  is 
optimal  with  respect  to  this  norm,  if  they  minimise  the 
error  measure: 

(i) 

^  l:=lr.eSk 
^  i=l  t=l 

where  l5(-)  is  the  indicator  function  of  set  S. 

A  widely  used  technique  for  finding  a  locally  optimal 
set  of  M  centers  is  the  k-means  algorithm  [18]:  one  starts 
with  a  random  center  configuration  which  is  then 
updated  using  the  rule: 


where 


m 


0+1)  _ 
'*  — 


(2) 


Sk^  =  {*.•  :  ll*i  -  ^11  <  ll»«-  -  *} 

It  has  been  shown  [1,  17]  that  this  algorithm  in  fact 
converges  to  a  local  minimum  of  (2).  For  our  purposes, 
however,  it  is  more  convenient  to  use  a  different  form  of 
this  algorithm,  in  which  each  data  point  is  used  in  turn 
to  update  the  corresponding  center  location  [15].  If  at 
time  t  point  z,-  is  selected,  we  put: 


if  i  €  5j[‘^  (3) 

=  otherwise 

where  {at}  is  a  non-increasing  sequence  of  scalars.  This 
scheme,  which  we  will  call  the  Local  K-means  Algorithm 
(LKMA)  has  the  obvious  advantage  of  being  able  to  op¬ 
erate  in  a  dynamic  environment,  where  data  are  contin¬ 
uously  arriving.  It  also  allows  for  generalizations  that 
produce  a  “self-organizing  behavior  of  the  centers  (see 
below).  However,  if  the  spatial  distribution  of  the  data 
is  non-uniform  (i.e.,  if  the  data  are  clustered)  its  perfor¬ 
mance  may  be  poor.  In  section  2  we  present  an  extended 


version  of  this  scheme  which  is  more  amenable  to  formal 
analysis. 

A  straightforward  generalization  of  this  technique 
may  be  used  for  classification  purposes  [15]:  if 
one  is  given  for  each  data  point  Xi  an  associated  class 
C(zi)  6  {ci  ...,ct},  one  may  find  Mj  optimal  centers 
for  each  class  j  :  =  l,...,L,k  = 

using: 

=  mjy +- a,(i,  -  mj.y),  (4) 

if  z,  6  and  C(zi)  =  j 
=  otherwise 

where  is  the  set  of  data  points  inside  the  Voronoi 

polytope  of  mjy .  This  procedure  is  similar  to  the  LVQl 
classification  method  [15]  (except  that  in  that  case, 
one  introduces  a  repulsion  term  from  those  points  with 
^(^t)  ^  j  inside  Sjk  to  push  the  centers  away 

from  the  decision  boundaries)  and  is  equivalent  to  the 
use  of  update  rule  (3)  for  each  class  in  a  decoupled  fash¬ 
ion.  After  convergence,  a  specimen  y  will  be  assigned  to 
class  j  if  for  some  k  , 

111)  -  <  jjy  -  m,„|l  for  all  l,n 

Although  this  technique  will  work  in  general,  it  has 
a  relatively  low  efficiency,  since  most  of  the  centers  are 
irrelevant  for  the  classification  task:  only  those  that  are 
close  to  the  inter-class  boundaries  will  play  an  effective 
role.  In  fact,  we  wUI  show  how  to  build  more  efficient 
classifiers  by  finding  centers  that  are  representative  of 
the  set  of  class  boundaries  instead  of  the  set  of  class 
interior  points.  This  will  be  discussed  in  the  following 
section. 

Another  modification  of  the  basic  procedure  (3)  allows 
one  to  give  a  neighborhood  structure  to  the  set  of  centers. 
Thus,  Kohonen’s  Self  Organizing  Maps  [15]  show  how  a 
1  or  2-dimensional  lattice  structure  may  be  imposed  to 
the  set  of  centers,  and  how  this  structure  may,  in  many 
cases,  reflect  the  internal  organization  of  the  data.  If  the 
neighborhoods  of  each  center  mk  at  time  (  are  defined 
as  sets  of  centers  Nk{t)  that  satisfy: 

m5‘^  €  nI*'>  O  €  Nj(t) 

^  ^  Nk{i) 

we  get  the  following  modified  update  equations: 

-f  a,(zi  -  m^*^),  if  Zj  €  (5) 

=  +  at(xi  -  m^*  V(ll»T»t  ^ 

if  Xi  e  Sj*\j  ji  k,  and  mk  €  A/>(f) 

=  m^  \  otherwise 

where  /i(-)  is  a  decreasing  function.  In  Kohonen’s  work, 
the  neighborhoods  {Nk{t)}  are  initially  very  large  and 
shrink  slowly  to  their  final  desired  size  (e.g.,  a  nearest 
neighbor  structure). 


This  scheme  suffers  from  some  limitations:  in  the  first 
place,  it  is  difficult  to  analyze  (except  in  stniie  particular 
cases  [25]),  and  thus  to  understand  its  performance  in  a 
precise  way;  besides,  the  neighborhood  structure  is  im¬ 
posed  rather  than  found  from  the  data  (although  some 
modifications  have  been  proposed  to  this  end  [21]),  which 
limits  its  usefulness  in  unsupervised  clustering  tasks. 

The  above  considerations  provide  the  motivation  for 
the  present  work:  it  is  our  purpose  to  extend  the  LKM  A 
so  that  some  of  its  limitations  are  overcome;  specifically, 
we  will  propose  extended  versions  of  the  algorithm  that: 

i)  Allow  for  a  rigorous  analysis  of  its  convergence 
properties. 

ii)  Work  well  for  clustered  data. 

iii)  Will,  if  desired,  find  the  centers  of  the  inter-class 
boundary  set. 

iv)  Adapt  the  number  of  centers  to  the  local  spatial 
density  of  the  data. 

v)  Find  the  “natural”  neighborhood  structure  for  the 
ren*»rs  of  a  data  set  (i.e.,  may  be  used  for  unsu¬ 
pervised  clustering). 

The  plan  of  the  presentation  is  as  follows:  in  sec¬ 
tion  2,  we  introduce  a  family  of  algorithms  that  include 
the  LKMA  as  a  limiting  case,  and  give  a  general  con¬ 
vergence  theorem  for  this  class;  also  in  this  section,  we 
present  some  basic  extensions  and  generalizations  needed 
for  finding  the  centers  of  the  inter-class  boundary  set, 
and  for  adapting  the  number  of  centers  to  the  data  den¬ 
sity.  In  section  3  we  give  examples  of  the  application 
of  the  extended  scheme,  specifically,  to  image  segmenta¬ 
tion  and  pattern  classification,  and  finally,  in  section  4, 
we  present  some  conclusions  and  open  problems. 

2  Extended  Local  K-Means  Algorithms 

First,  we  will  introduce  a  generalization  of  the  LKMA 
that  will  help  us  to  understand  its  convergence  proper¬ 
ties.  Consider  again  a  set  X  =  {xi, . . .  ,a:/v}  of  points 
in  ,  a  set  of  centers  {mi , . . . ,  mu  } ,  and  the  weighted 
error  measure: 

^^(»^)  =  4  53  II*.  -  '"fciPw'ft  (6) 

i  =  l  4  =  1 

where 

S  _  exp[-/?||x.-  -  mtlp] 

‘‘  Ef=i«p[-/9||xi-m,|P] 

It  is  clear  that  the  error  measure  (2)  may  be  obtained 
as  the  limit  of  (6)  as  /?  — ►  oo.  Now,  £0  is  differentiable, 
and  its  gradient  with  respect  to  is  given  by: 

N 

Vt£0  =  5;(m*  -  Xi)Wf,  (8) 

•  =1 

where 

’"oil**  “  ”*ill^  “  II**  “  ”»ill*)) 


A  standard  gradient  descent  procedure  for  nuiuniizing 
£g  would  therefore  take  the  form: 

*  -  oi  (9) 

1=1 

for  some  sequence  {ut}  converging  to  zero.  However,  it 
is  often  more  convenient  to  adopt  a  stochasUc  gradient 
descent  algorithm  for  minimizing  £g,  that  is  equivalent 
to  approximating,  at  each  step,  the  sum  on  the  right  side 
of  eq.  (9)  with  just  one  term,  randomly  drawn  among 
the  N  terms.  In  formula: 

~  ^(.W^k  (^0) 

where  {(t }  is  a  sequence  of  random  variables  which  take 
values  on  { 1, 2, . . . ,  IV}.  This  minimization  technique  is 
especially  convenient  when  the  data  points  x,  come  one 
at  a  time,  and  it  has  been  extensively  used  by  the  neural 
network  community,  as  a  part  of  the  so  called  “back- 
propagation”  procedure  for  neural  networks  training.  In 
the  limit  as  /?  — ►  oc,  (10)  becomes  precisely  the  LKM.\ 
(3).  The  convergence  of  (10)  to  a  local  minimum  of  Eg 
follows  from  the  following  lemma  (the  proof  is  given  in 
the  appendix;  see  also  [28]  for  closely  related  results): 

Lemma  2.1  Let  F{y)  :  >-*  IL  be  of  the  form: 

1  ^ 

F{y)  =  fo(y)  +  jYlf'(y^ 

^  1=1 

where  ft  are  differentiable  functions  whose  gradient  is 
bounded  and  satisfies  the  following  Lipschiz  condition: 

\Ffi(y)  -  V/.(y')ll  <  M|iy  -  y'll  i  =  0, . . . ,  N 

for  some  positive  number  M.  Let  {j/n}  be  the  sequence 

Vn+l  =yr.  -an(V/o(yn)  +  V/{,(y„))-l-0„7„  (11) 

where  {(„}  is  a  sequence  of  random  variables  that  take 
values  in  {!,..., W)  with  uniform  probability  distribu- 
tion,  {un}  is  a  sequence  of  scalars  satisfying: 

OO  OO 

^  o„  =  oo  and  ^  <  oo 

t=i  .=1 

and  {r}„}  is  a  sequence  of  random  variables  that  is 
bounded  and  converges  to  zero  with  probability  one.  As¬ 
sume  that  {yn}  is  bounded  and  that  S  is  a  locally  asymp¬ 
totically  stable  point  of  the  ordinary  differential  equation: 

y  =  -VF  (12) 

with  domain  of  attraction  A5.  Then,  if  yn  £  G  for  all 
n,  for  some  compact  domain  G  C  As,  {yn}  converges  to 
S  with  probability  one. 

To  satisfy  the  conditions  of  the  lemma,  the  sequence 
{ot}  in  (10)  must  be  chosen  appropriately.  One  possible 
choice  is,  for  example. 


2 


at  =  1/t  . 


2.1  Soft  Winner-Takes-All  K-Means 

Algorithm  (10)  may  be  understood  as  a  “soft”  version  of 
the  WTA  scheme  that  implements  (2);  if  /?  is  relatively 
small,  when  a  new  data  point  arrives,  it  will  update  the 
position  of  not  only  the  closest  center,  but  of  others  that 
are  close  by  as  well. 

A  similar  result  may  be  obtained  by  using,  instead  of 
(6),  an  information-related  distorsion  measure  [2]: 

.  yv  r 

f^(m)  = -— J^log  J^expMIlii -mtip]  (13) 

i  =  l  U=] 

This  error  measure  also  corresponds  to  the  log  likelihood 
function  of  a  Gaussizm  mixture  model  for  the  data  dis¬ 
tribution  [10]  [23],  where;  the  means  of  the  Gaussians 
correspond  to  the  center  locations;  the  proportions  are 
all  equal  to  l/N,  and  all  the  covariance  matrices  are 
equal  to  /?”*/. 

In  this  case,  the  corresponding  stochastic  gradient  de¬ 
scent  equation  takes  the  form; 

^  (14) 

where  wf^  is  given  by  (7).  In  the  limit  as  ►  oo  both 
(10)  and  (14)  give  the  same  WTA  update  equation;  their 
experimental  behavior  is  also  very  similar. 

These  schemes  are  also  related  to  the  self-organizing 
maps  (5),  except  that  in  this  case  the  neighborhood  of 
each  center  is  not  predetermined,  but  rather,  it  varies 
dynamically  as  the  iterations  proceed:  the  relative  val¬ 
ues  of  the  weights  for  each  i  are  related  to  the 

instantaneous  neighborhood  structure  of  the  centers;  we 
may  define  the  instantaneous  link  status  of  centers  j  and 
k  as: 

1=1 

and  the  “natural”  neighborhood  structure  for  a  given 
data  set  by  the  average  /,*  of  /j*  over  all  i:  two  centers 
{j,k)  may  be  considered  neighbors  if  /,*  >  ff,  for  some 
appropriate  threshold  0  (a  similar,  although  computa¬ 
tionally  more  expensive  scheme  may  be  found  in  [21]). 
Figure  1  shows  this  natural  neighborhood  structure  for 
several  two-dimensional  data  sets.  As  one  can  see,  it 
represents  adequately  the  inner  structure  of  the  data, 
and  so,  it  may  be  used  for  unsupervised  clustering  tasks. 


collapse,  they  will  remain  in  that  state  regardless  of  an 
increased  $  (if  they  are  updated  in  parallel).  This  may 
be  accomplished  by  adding  a  /^-dependent  random  com¬ 
ponent  to  the  update  equation; 

+  at{xi  -  +  ^r(t),  (15) 

if.GSi‘> 

=  otherwise 

where  r(t)  is  a  uniformly  distributed  unit  D-vector  and 
(  is  a  small  number  (the  convergence  of  this  modified 
scheme  also  follows  from  lemma  1).  The  relative  per¬ 
formance  of  this  scheme,  compared  with  the  standard 
LKMA  is  illustrated  in  figure  2. 

- Figure  2  around  here _ 

2.2  State-Augmented  LKMA 

Another  way  of  improving  the  performance  of  the  stan¬ 
dard  LKMA,  is  to  include  in  the  state  of  each  proces¬ 
sor  (center)  statistics  about  its  past  dynamic  behavior. 
Specifically,  one  may  keep  track  of  the  number  of  times 
it  has  “won”  over  the  other  processors; 

t 

^*(0  = 

T=1  * 

where  t(f)  is  the  index  of  the  data  point  selected  at  time 
t,  and  (mk,hk)  is  the  augmented  state  of  processor  k. 

For  multi-class  data,  one  may  use  specific  centers  for 
each  class,  and  include,  for  each  one  of  them,  another 
state  variable  h  that  counts  the  number  of  times  the 
class-specific  processor  has  “won”  with  a  data  point  be¬ 
longing  to  its  class.  Thus,  if  mjk  is  the  k***  center  for 
class  j,  one  has: 


^ft(0  —  y  ^c(’’)(^i(t))  ■  ^(C^(®i{T))  ~  j)  (If) 

T=i  ’*■ 

where  C7(z,')  is  the  class  to  which  Xi  belongs  and  S(  )  is 
the  standard  Kroneker  delta  function. 

We  will  now  indicate  how  to  modify  the  update  rule, 
so  that  this  information  is  taken  into  account. 


_ Figure  1  around  here _ 

This  soft- WTA  algorithm  may  also  be  useful  for  im¬ 
proving  the  performance  of  the  LKMA  with  clustered 
data.  To  this  end,  one  may  use  a  time-varying  P,  rather 
than  a  constant  one:  starting  with  a  relatively  small 
value  ensures  that  every  center  will  be  attracted  to  some 
data  cluster,  regardless  of  its  (random)  initial  position; 
the  value  of  may  then  be  increased  to  obtain  a  fi¬ 
nal  configuration  that  minimizes  the  non-weighted  cost 
function  (2).  One  has  to  be  careful,  however,  when  work¬ 
ing  with  small  values  of  so  thi^  the  center  positions 
are  kept  different  from  each  other,  since  if  two  centers 


2.2.1  Adaptive  Number  of  Centers 

This  augmented  state  information  may  be  used  for 
adapting  the  number  of  centers  to  the  local  density  of 
the  data  points.  This  may  be  accomplished  by  updating 
the  center  configuration  after  each  full  sweep  over  the 
data  set  (or  after  a  sufficiently  large  number  of  data  has 
come  in),  supressing  those  centers  that  have  won  very 
few  times,  and  splitting  those  that  have  won  too  many 
times.  Specifically: 

For  each  center  k: 


if  hk  <  BiN,  supress  the  center.  (18) 


else  if  ht  >  6,N ,  generate  a  new  center  at  a  loca¬ 
tion  corresponding  to  one  the  data  points  inside  the 
current  Voronoi  polytope  of  center  k. 

where  Oi,d,  €  [0,1]  are  two  suitably  chosen  thresholds 
such  that  (tf,  —  Si)yV  >  1.  Note  that  this  choice  for 
the  location  of  the  new  centers  is  necessary  to  ensure 
convergence  (see  below);  in  practice,  however,  one  may 
simply  locate  them  at  m*  +  er  where  r  is  a  random  unit 
vector,  and  e  a  positive  number  small  enough,  so  that 
the  new  center  is  attracted  by  at  least  one  data  point 
inside  5* . 

If  the  total  number  of  centers  change  after  a  sweep, 
one  should  reset  /»*  to  zero  for  all  k,  and  at  to  at-N,  and 
effect  a  new  sweep  until  the  number  of  centers  stabilize. 

It  is  not  difficult  to  show  the  convergence  of  (18)  when 
the  lower  threshold  6i  is  set  to  0:  in  this  case,  one  starts 
with  one  processor  (center)  and  successively  generate 
new  ones  until  the  Voronoi  polytopes  corresponding  to 
all  centers  contain  less  than  6sN  data  points.  Since  in 
this  case  the  number  of  centers  increases  monotonically 
and  it  is  bounded  above,  (e.g.,  by  the  total  number  of 
data  points),  it  will  necessarily  converge  to  a  fixed  num¬ 
ber  M*. 

0,  is  a  free  parameter  that  controls  the  expected  av¬ 
erage  number  of  points  per  center,  while  6i  controls  the 
variance  of  this  number  (a  small  variance  is  obtained  if 
{Os  —  Oi)  is  small).  The  fact  that  one  has  control  over 
the  variance  means  that  one  can  generate  more  uniform 
center  distributions  with  this  method  than  with  the  stan¬ 
dard  k-means  scheme.  This,  in  turn,  will  usually  improve 
significantly  the  performance  of  other  procedures  that 
may  use  these  centers,  for  example,  for  vector  quantiza¬ 
tion  or  for  function  approximation  (see  section  3). 

In  practice,  it  is  convenient  to  set  the  lower  thresh¬ 
old  to  a  positive  value  to  prevent  centers  to  be  attracted 
to  single  outlier  data  points,  as  well  as  the  existence  of 
centers  with  empty  Voronoi  polytopes  (which  may  hap¬ 
pen  due  to  random  initialization,  if  one  starts  with  more 
than  one  center).  Note,  however  that  convergence  can¬ 
not  be  guaranteed  in  this  case;  consider  the  following 
example:  suppose  we  have  N  =  7  data  points;  NO,  =  5 
and  NOi  =4.  It  is  clear  that  the  number  of  centers 
generated  by  (18)  will  always  oscillate  between  1  and  2. 

This  kind  of  pathological  situations  are  unlikely  to 
occur  in  practice  though,  specially  if  {0,  —  Oi)  is  large 
enough  (say,  if  >  30i).  A  practical  way  of  ensuring 
convergence  in  any  case  is  to  let  0i  go  to  zero  after  a 
fixed  number  of  iterations. 

2.2.2  Boundary  Finders 

In  the  case  of  multi-class  data,  the  augmented  state 
may  be  used  to  find  the  inter-class  boundaries  directly 
from  sparse  data. 

Let  us  assume  that  we  have  class-specific  centers  for 
classes  1  through  M  —  1.  Consider  the  2-cIass  case  first: 
the  augmented  state  will  contain,  for  each  center  k,  the 
vector  (m*,At,Ait)  (we  only  have  one  type  of  centers 
in  this  case).  The  idea  is  to  constrain  the  update  rule 
so  that  a  center  position  is  updated  approximately  the 
same  number  of  times  by  data  points  belonging  to  each 


class,  so  that  at  all  t, 


"U  ~  "2k  ~  2“* 


The  boundary-finding  update  rule  may  thus  take  the 
form; 


+  a,{xi  -  (19) 

if  I,  6  and  A,y  <  ^A^  ^ 

=  ^  ,  otherwise 

where  Xi  is  the  data  point  chosen  at  time  t. 

It  is  clear  that  with  this  rule  we  will  have,  at  any  time 

t.|Al*>-2A‘V  (<  1,  so  that  there  will  be  approximately 
the  same  number  of  data  points  belonging  to  each  class 
inside  the  Voronoi  poly  tope  of  each  center,  provided  that 
the  data  density  is  uniform.  In  this  case,  upon  conver¬ 
gence,  every  center  k  will  be  located  at  about  the  mid¬ 
point  of  the  centroids  of  the  sets  {Co  H  S*  }  and  {Ci  n  S*  ) 
where  C„  is  the  set  of  data  points  of  class  n. 

To  see  why  this  is  true,  note  that  when  the  update 
rule  (20)  reaches  its  steady  state,  we  must  have  tl.  ' 

2 

-  m/b)]  =  H  ~  c)  =  0 

c=i  x.eSk 

where  Pt(t,  c)  is  the  probability  of  selecting  an  example 
t  that  is  in  Sk  and  belongs  to  class  c  and  £[■]  denotes 
the  expected  value.  Now, 


Pk(t,  c) 


Pr(select  t  1 1  €  Cc  n  5t)  Pr(select  C,)  « 


1  1 
I  Cc  n  St  r  2 


(20) 


where  |  Ce  DSt  |  denotes  the  number  of  points  of  class 
c  inside  St,  so  that 
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It  is  in  this  sense  that  one  may  say  that  the  centers 
are  representative  samples  of  the  inter-class  boundary 
set. 

This  procedure  may  be  generalized  to  Q  >  2  classes, 
by  samphng,  for  each  class  {1,...,Q— 1},  the  boundary 
between  itself  and  all  the  other  classes.  This  however,  is 
not  very  efficient,  since  many  parts  of  the  boundary  will 
be  sampled  several  times.  A  more  economical  sampling 
may  be  obtained  by  defining  the  sets: 


To*=C* 


T*  =Ct+,  UCi+2U...CQ 

for  A  =  1, ...  Q  —  1,  and  finding,  for  each  k,  the  centers 
{mti }  that  sample  the  boundary  between  the 
sets  Tq  and  Tf  using  algorithm  (20). 

It  is  of  course  possible  (and  desirable)  to  combine  this 
procedure  with  the  one  for  finding  the  number  of  centers 
in  an  adaptive  way.  Figure  3  shows  the  performance  of 


this  combined  scheme  for  binary-class  data  in  2  dimen¬ 
sions.  Other  examples  of  applications  of  this  algorithm 
are  presented  in  section  3. 

_ Figure  3  around  here _ 


2.3  Arbitrary  Neighborhood  Structures  and 
Smoothness  Constraints 

We  presented  in  section  1  (equation  (5))  Kohonen’s  mod¬ 
ification  to  the  LKMA,  which  produces  “self-organizing 
maps”  between  the  data  points  and  the  renters,  that 
is,  center  configurations  with  a  prescribed  neighborhood 
structure.  These  maps  are  interesting  because  they  pro¬ 
vide  a  model  for  the  topology-preserving  mappings  that 
are  known  to  exist  between  sensory  inputs  and  the  brain 
cortex  (6,  14]. 

Kohonen’s  algorithm,  however,  presents  a  number  of 
problems:  firstly,  it  is  difficult  to  analyze  (its  convergence 
and  the  properties  of  its  fixed  points  have  not  been  es¬ 
tablished  in  the  general  case);  secondly,  its  experimental 
rate  of  convergence  is  usually  very  slow,  and  finally,  it  is 
not  clear  how  to  extend  it  to  include  other  desired  prop¬ 
erties  of  the  center  configurations  (i.e.,  the  requirement 
that  the  centers  lie  in  a  smooth  curve,  etc.). 

In  this  section  we  will  present  an  alternative  algo¬ 
rithm  for  producing  organized  center  structures  which 
can  be  derived  from  a  Bayesian  formulation  of  the  prob¬ 
lem,  with  a  Gibbsian  prior  for  the  center  configurations. 
This  approach  not  only  permits  a  cleaner  analysis  of  the 
algorithm,  but  also  exhibits  faster  convergence  behavior, 
and  can  be  easily  generalized  to  include  other  properties 
(e.g.,  smoothness)  of  the  configurations. 

The  basic  idea  in  this  approach  is  to  express  the  prior 
constraint  on  the  organization  of  the  centers  in  prob¬ 
abilistic  terms,  specifically,  in  the  form  of  a  discrete 
Markov  Random  Field  (MRF)  model  [20,  13,  3,  26,  7],  in 
which  the  center  locations  correspond  to  the  state  vari¬ 
ables  associated  with  the  nodes  of  a  graph  whose  topol¬ 
ogy  is  related  — but  not  necessarily  identical —  to  the  de¬ 
sired  neighborhood  structure  of  the  centers;  in  fact  this 
structure,  as  well  as  the  smoothness  of  the  locations  of 
the  center  configurations  will  depend  both  on  the  graph 
topology  and  on  the  particular  choice  of  the  potential 
functions  that  define  the  model. 

The  prior  probability  distribution  of  the  center  con¬ 
figuration  will  thus  be  of  the  form: 

T(m)  = 

where  Z  is  a  normalizing  constant  and  the  energy  U  is 
of  the  form: 

f/(m)  =  5;Vc(m) 
c 

where  the  sum  is  taken  over  the  chques  {C}  of  the  cor¬ 
responding  neighborhood  system  and  Vc{  )  are  potential 
functions  that  depend  only  on  the  values  of  the  state 
variables  associated  with  the  nodes  contained  in  each 
clique. 

The  optimal  center  configuration  may  now  be  defined 
as  the  most  likely  one  (given  the  prior  MRF  structure) 


that  minimizes  the  error  criterion  (6),  i.e.,  as  the  Max¬ 
imum  a  Posteriori  (MAP)  estimator  of  a  Gibbsian  field 
with  posterior  energy  given  by: 

llp(m)= 

i  k  c 

where  A  is  a  positive,  real  parameter. 

If  the  potential  functions  are  differentiable,  the  fol¬ 
lowing  update  rule  will  converge  to  a  local  minimum  of 
(21): 
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Ckec 


(22) 

for  an  appropriate  choice  of  the  sequence  {at}  (see 
lemma  1). 

Note  that  in  the  particular  case  of  cliques  of  size  2, 
and  quadratic  potentials  of  the  form: 


Vjk{m)  =  ||mj  -  mtip 


the  posterior  energy  corresponds  to  the  composite  cost 
function  discussed  in  [2],  and  the  corresponding  update 
equation,  for  first  order  (nearest-neighbor)  systems,  re¬ 
duces  to  the  algorithm  proposed  by  Durbin  and  Mitchi- 
son  [6]  for  the  development  of  cortical  mtq>s. 

Self-organization  of  the  centers  in  rectangular  lattices 
may  be  obtained  using  second  order  neighborhoods  and 
cliques  of  size  3  that  correspond  to  triads  of  neighboring 
sites  that  lie  in  the  same  row  (or  column)  of  the  lattice, 
so  that  3  centers  belonging  to  the  same  clique  will  not 
contribute  to  the  energy  if  they  lie  in  a  straight  line. 

If  the  lattice  size  is  relatively  large,  however  (greater 
than  about  8x8),  the  system  (22)  with  this  type  of 
potentials  will  very  often  converge  to  local  minima  tor 
which  the  lattice  appears  “folded”  in  some  way,  so  that 
the  global  order  is  not  properly  established  (see  figure  4). 
This  can  be  remedied  in  two  ways:  first,  it  is  necessary 
to  include  potentials  that  assign  high  energies  to  folded 
configurations;  a  simple  choice  is  to  assign  to  cliques  of 
4  sites  {i,j,k,l),  that  lie  in  the  corners  of  unit  squares 
in  the  lattice,  potentials  of  the  form: 

i[l|m<  -  m^ll*  -  Ijm*  -  m,||=*]’ 


where  {i,j),{j,k),{k,l)  and  (/,  i)  are  nearest  neighbors. 
Secondly,  it  is  necessary  to  define  a  neighborhood  system 
that  spans  a  wide  range  of  scales,  with  potentials  that 
increase  their  relative  weights  with  the  scale  size.  Thus, 
for  example,  if  is  the  center  at  row  i  and  column  j 
of  the  lattice,  the  multi-scale  potentials  that  correspond 
to  column  alignment  may  take  the  form: 


=  Atll  -  m,j_t  +  2m, j  -  mij+*|p  (23) 
with  At+i  >  At,  for  fc  =  1, . .. ,  A. 


_ Figure  4  around  here _ 

A  more  efficient  way  of  achieving  the  same  result  is  to 
define  a  “pyramid”  of  processes  that  operate  from  coarse 
to  fine  scales,  and  that  increase  the  number  of  centers  at 
each  refining  step:  one  may  start  with  a  3  x  3  lattice  , 
which  after  a  few  iterations  of  (22)  may  be  refined  by 
adding  intermediate  centers  whose  initial  positions  cor¬ 
respond  to  the  midpoints  of  existing  ones  (see  figure  5), 
and  repeat  the  whole  procedure  recursively  until  the  de¬ 
sired  number  of  centers  is  obtained.  Note  that  with  this 
procedure,  the  potentials  are  always  of  the  form  (23) 
with  ib  =  1. 


_ Figure  5  around  here _ 

The  final  configurations  obtained  in  this  way  (see  fig¬ 
ure  4-b)  are  very  similar  to  those  obtained  with  Koho- 
nen’s  algorithm  [15]  (which  also  incorporates  long  range 
interactions),  but  since  the  neighborhood  size  remains 
fixed,  the  computational  complexity  is  lower,  and  since 
the  new  centers  are  already  close  to  their  correct  (glob¬ 
ally  ordered)  positions,  the  convergence  rate  is  signifi¬ 
cantly  faster. 

It  is  convenient,  as  in  the  case  of  Kohonen’s  scheme, 
to  mantain  a  fixed,  relatively  large  at  in  (22)  until  the 
final  number  of  centers  is  reached.  At  this  point,  the  use 
of  an  appropriately  decreasing  sequence  guarantees  the 
final  convergence  to  a  local  minimum  of  (21)  (see  lemma 
1).  Other  examples  of  the  use  of  this  approach  will  be 
given  in  the  next  section. 

3  Applications 

In  this  section  we  present  some  applications  that  illus¬ 
trate  the  power  of  the  techniques  we  have  developed 

3.1  Edge  Detection  from  Sparse  Data 

Suppose  we  have  a  set  of  sparse  data  points  X  = 
{xi .,Xff}  inside  the  unit  square  Q  in  R^,  which  may 
belong  to  either  one  of  two  classes  (0, 1},  and  suppose 
there  is  a  closed  region  A  C  whose  boundary  is  a 
closed,  smooth  curve,  and  such  that 

C(z)  =  1  O  z  €  i4 

(i.e.,  all  the  data  points  in  class  1  are  inside  A).  The 
problem  now  is  to  find  a  polygonal  line  (i.e.,  a  sequence 
of  points  {mi, . . .  mjif })  that  lies  close  to  the  smooth 
curve  that  defines  the  boundary  of  A. 

This  may  be  achieved  by  combining  the  adaptive 
boundary-finding  scheme  of  section  2.2.2  with  a  prior 
MRF  constraint  on  the  configuration  of  centers  that  cor¬ 
responds  to  a  circular  lattice  (i.e.,  a  closed  polygonal 
line).  In  particular,  to  every  clique  of  3  neighboring  sites 
(i,j,k)  we  associate  the  potential: 

Vijkim)  =  ^11  -  m.  +  2m,-  -  m*||^ 

(note  that  mi  and  mm  are  considered  neighbors  in  a 
circular  lattice). 

The  combined  update  rule  takes  the  form: 
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(24) 


where  z,  is  the  data  point  chosen  at  time  t. 

An  example  of  the  performance  of  this  algorithm  is 
shown  in  figure  6.  Note  that  since  in  the  final  config¬ 
uration  the  centers  are  ordered,  this  scheme  is  in  fact 
finding  the  (discrete)  boundary  curve  from  sparse  data 
without  interpolating  the  corresponding  surface.  In  this 
sense,  it  may  be  said  that  this  algorithm  finds  the  initial 
position,  the  number  of  knots  and  the  final  configura¬ 
tion  of  a  “snake”  [12]  that  approximates  the  inter-class 
boundary. 


Figure  6  around  here 


_ Figure  7  around  here - 

With  straightforward  modifications,  this  scheme  may 
be  used  for  finding:  multiple  closed  boundaries  (fig.  7-a 
and  7-c);  open  curves  that  go  from  one  border  of  the 
image  to  another  (fig.  7-b  and  7-d),  etc. 

3.2  Local  Gaussian  Classifiers 
Gaussian  Classifiers  [5]  are  a  well  known  class  of  pro¬ 
cedures  for  the  segmentation  of  multi-class,  multi¬ 
dimensional  data.  The  classification  procedure  for 
each  specimen  z  €  R"  involves  the  computation  of  Q 
quadratic  discriminant  functions  (one  for  each  class)  : 


Dk  =  (z-/ii)^Ej‘(z-/it)-|-log|Et  I 

forjfc=l...,Q  (25) 


where  fn  is  the  estimated  location  of  the  centroid  of 
class  it;  Et  =  [(rj*’]  is  the  estimated  (n  x  n)  covariance 
matrix  and  |  Et  |  is  its  determinant.  The  specimen  is 
then  assigned  to  the  class  with  the  lowest  value  of  Dt . 

The  “learning”  phase  consists  in  the  computation  of  fi 
and  E  from  a  set  of  examples  {zi, . . .  ,ZAf}  with  known 
classes  {ci,..  .,cjv}: 


Xr6(C{Xr)  -  k) 

E;Ll«(C(*r)-*) 


=  E^=l  XriXrj6iC(Xr)  -  k)  _ 

l:r=lHC(Xr)-k)  '  ^ 

so  that  it  takes  only  one  pass  through  the  data. 


(26) 

(27) 


The  performance  of  this  kind  of  classifiers  will  be  opti¬ 
mal,  of  course,  only  if  the  actual  distribution  of  the  data 
corresponds  to  a  set  of  Gaussians  centered  at  the  class 
centroids;  it  is  possible,  however,  to  extend  this  idea  for 
more  complex  cluster  shapes,  by  approximating  the  data 
distribution  for  each  class  with  a  sum  of  Gaussian  clouds. 
In  this  way,  the  classification  is  still  done  by  taking  the 
minimum  of  the  discriminant  functions  (25),  except  that 
now  k  will  be  a  2-dimensional  index:  k  =  (ki^k^)  rep¬ 
resenting  cluster  k2  of  class  ki,  and  the  minimization 
should  be  done  over  all  clusters  of  all  classes. 

Note  that  the  precise  location  of  the  clusters  is  not 
important,  as  long  as  the  decision  boundary  between  ad¬ 
jacent  clusters  of  different  classes  is  well  approximated 
by  a  quadratic  hypersurface;  in  many  cases,  this  will 
happen  provided  the  set  of  midpoints  between  the  cor¬ 
responding  centroids  samples  the  boundary  manifold  at 
a  sufficiently  high  rate. 

This  suggests  the  following  strategy  for  the  learning 
phase  of  the  compound  classifier: 

1  :  Find  a  distribution  of  centers  that  samples  the 

inter-class  boundary  manifold  using  the  boundary¬ 
finding  scheme  described  in  section  2.2.2; 

2  :  Use  the  data  points  inside  the  Voronoi  polytope 

of  each  center  to  learn  the  parameters  of  a  local 
Gaussian  classifier  using  (26)  and  (27)  (computing 
the  centroid  parameters  only  for  those  classes  that 
have  representative  points  inside  the  polytope). 

Note  that  the  update  rule  (20)  and  the  procedure 
for  the  adaptive  determination  of  the  number  of  centers 
guarantee  that  upon  convergence,  there  wUl  be  observa¬ 
tions  for  at  least  two  different  class  clusters  inside  every 
Voronoi  poly  tope. 

The  performance  of  this  procedure  is  exemplified  in 
figure  8,  which  illustrates  a  binary  classification  task  for 
2-dimensional  data. 

We  also  tested  this  technique  with  a  classification 
problem  in  5  dimensions.  Table  1  compares  its  perfor- 
meuice  with  other  known  classifiers  (this  table  is  included 
only  as  an  illustration;  the  experiment  was  not  intended 
to  perform  a  formal  comparison  test).  Note  that  since 
this  procedure  generates  the  same  number  of  centers  per 
class  (in  binary  classification  problems),  its  performance 
will  deteriorate  if  the  data  are  not  evenly  distributed 
among  classes.  In  this  case,  its  performance  may  be¬ 
come  worse  than  that  of  other  classifiers. 

The  training  data  set  consisted  of  1000  data  points 
in  the  unit  5  dimensional  cube  [0,1]^.  Of  these  points, 
I  were  of  class  1,  and  ^  of  class  2.  Half  of  the  points 
of  class  1  were  inside  a  5  dimensional  hypersphere  of 
radius  0.07  and  the  other  half  are  outside  a  5  dimensional 
hypersphere  of  radius  0.2.  The  points  of  class  2  were 
between  these  2  hyperspheres.  The  test  set  consisted  of 
10000  data  points  with  the  same  distribution. 

The  rows  labeled  “RBF  (adap)”,  “RBF  (soft)”  and 
“RBF  (fixed)  ”  correspond  to  the  classifiers  that  are  ob¬ 
tained  by  approximating  the  indicator  function  of  one 
of  the  classes  (say,  class  2)  with  a  linear  combination 
of  Gaussians  with  fixed  covariance  al  (in  all  cases  we 
used  <r  =  5.0),  and  centered  at  a  fixed  set  of  points  [22]. 


Table  1:  Comparison  of  local  Gaussian  (LGC)  and  neu¬ 
ral  network  (NN)  classifiers  on  a  5  dimensional  example. 


These  points  may  be  found  using:  the  k-means  algo¬ 
rithm  with  a  fixed  number  of  centers  (“RBF  (fixed)” 
rows)  or  the  adaptive  strategy  of  section  2.2.1  (“RBF 
(adap)”  rows).  As  one  can  see,  the  performance  is  sig¬ 
nificantly  improved  in  the  latter  case,  due  to  the  fact 
that  the  center  distribution  is  more  uniform.  A  similar 
improvement  has  been  reported  if  the  standard  k-means 
algorithm  is  replaced  by  the  “soft-WTA”  version  (14) 
[23].  The  performance  of  this  scheme  is  included  in  the 
rows  labeled  “RBF  (soft)”,  for  comparison. 

As  one  can  see,  the  local  Gaussian  classifier  has  the 
best  performance  on  the  test  set.  The  feedforward  neu¬ 
ral  network  with  50  hidden  units  and  the  RBF  network 
with  50  centers  have  approximately  the  same  number  of 
parameters  as  the  local  Gaussian  classifier  with  16  pro¬ 
cessors. 

In  order  to  test  the  LGC  on  a  set  of  real  data  we  con¬ 
sidered  the  same  task  of  gender  classification  that  has 
been  considered  by  Brunelli  and  Poggio  in  [4].  Brunelli 
and  Poggio  had  a  data  set  consisting  of  168  digitized  pic¬ 
tures  of  frontal  views  of  people  without  facial  hair,  and 
the  task  was  the  classification  of  the  gender.  There  were 
^i  male  and  21  female  subjects  in  the  data  set,  and  4 
pictures  per  subject.  Each  picture  was  represented  by 
a  lO-dimensional  vector  of  automatically  extracted  ge¬ 
ometrical  features,  and  we  were  provided  with  a  set  of 
168  such  vectors.  In  their  analysis  Brunelli  and  Poggio 
found  that  only  few  of  the  variables  were  relevant  for  the 
classification  task,  and  therefore  we  used  only  the  first 
8  entries  of  each  lb-dimensional  vector,  that  included 
the  most  relevant  variables  found  by  Brunelli  and  Pog¬ 
gio.  We  did  not  attempt  to  find  the  best  set  of  variables 
that  describes  this  task.  Since  the  number  of  training 
examples  is  small,  relative  to  the  dimensionality  of  the 
problem,  we  used  the  LGC  without  the  boundary  find¬ 
ing  scheme  for  locating  the  centers.  In  order  to  test  the 
performances  of  the  LGC  we  adopted  two  procedures: 

1.  the  data  set  was  randonnly  split  in  4  equal  subsets, 
3  of  which  were  used  for  training  and  1  for  test¬ 
ing.  This  was  repeated  10  times  and  the  average 
training  and  testing  errors  computed; 

2.  the  leave-one-out  procedure  described  by  Brunelli 


and  Poggio  in  [4]. 

In  both  cases  the  treuning  and  test  error  were  less  than 
hVi  with  3  centers  (3  Gaussians  per  class),  and  less  than 
8%  with  one  Gaussian  per  class. 

_ Figure  8  around  here _ 


3.3  Image  Segmeutatiou 

As  a  final  example,  we  consider  an  image  segmentation 
problem  that  arises  in  the  processing  of  certain  biomed¬ 
ical  images:  scintigraphic  images  [11,  19],  which  are  ob¬ 
tained  by  counting  the  number  of  radioactive  particles 
that  incide  on  each  cell  of  a  receptor  array.  The  goal  of 
the  processing  step  is  to  obtain  from  these  measurements 
an  estimate  of  the  radioisotope  distribution  in  specific 
organs  within  the  human  body. 

Particle  count  and  radioisotope  concentration  are  re¬ 
lated  by  the  Poisson  distribution  formula;  therefore,  the 
processing  step  consists  in  the  restoration  of  a  piecewise 
smooth  function  corrupted  by  Poisson  noise.  If  it  were 
possible  to  find  the  boundaries  of  the  organ  in  question 
(e.g.,  the  heart),  the  problem  would  reduce  tc  filtering  a 
smooth  function  within  a  given  domain,  for  which  effec¬ 
tive  methods  are  available  (for  example,  Bayesian  esti¬ 
mation  methods  with  MRF  priors  and  quadratic  poten¬ 
tials  to  model  the  smoothness  constraint  [20]).  In  the 
example  that  we  give  here,  we  show  that  it  is  possible 
to  adapt  the  methods  that  we  have  presented  to  classify 
the  pixels  of  a  scintigraphic  image  of  the  heart  in  such 
a  way  that  one  class  corresponds  approximately  to  the 
interior  of  the  organ. 

To  do  this,  we  will  use  the  following  concepts; 

We  assume  that  the  two  classes  ue  characterized 
solely  by  the  intensity  level  of  the  image,  i.e.,  the  in¬ 
terior  (class  1)  has  high  intensity  with  respect  to  the 
background  (class  2).  It  is  assumed  that  the  classes  are 
fuzzy  sets  [29]  with  membership  functions  of  the  form: 


1  -F  exp[-/?(z  -  ff)] 

4>2{2)  =  l-^i(^) 

where  /?  and  0  are  positive  parameters. 

The  formulae  for  the  parameters  of  the  discriminant 
functions  of  the  local  Gaussian  classifier  c  are  modified 
in  the  obvious  way: 


c  _  Er  Xr(l>k(iiXr)) 


(c.k)  _  Er  Xr.Xr>t(z(Xr)) 


(29) 

(30) 


where  the  sums  are  taken  over  the  learning  domain  of 
the  classifier;  Xr  denotes  the  coordinates  of  pixel  r  of 
the  image,  and  z(xr)  denotes  the  value  of  the  observed 
intensity. 

The  learning  domain  of  each  local  classifier  is  taken 
as  before,  as  the  Voronoi  polygon  of  a  center  that  sam¬ 
ples  the  image  in  an  appropriate  way.  In  this  particular 
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example,  we  are  interested  in  segmenting  tlie  left  ven- 
tricule  of  the  hearth  taken  from  a  left  anterior  oblique 
projection.  From  this  viewpoint,  the  ventricule  appears 
as  a  high  intensity  “donul"  over  a  dark  background  |see 
figure  y-a).  Therefore,  it  is  desirable  that  the  centers 
are  located  uniformly  along  a  closed,  smooth  curve  that 
is  attracted  towards  the  higher  intensity  region  of  the 
image  (note  that  the  quadratic  decision  surface  of  each 
local  classifier  may  be  an  hyperbola,  and  therefore,  it 
can  adequately  segment  a  region  that  looks  like  a  band 
within  its  domain). 

Since  a  scintigraphic  image  is  actually  representing 
particle  counts,  we  may  use  update  rule  (24)  directly, 
considering  that  at  each  location  Xr  there  are  c(Zr)  data 
points.  To  get  an  appropriate  behavior  for  this  update 
rule,  however,  it  is  necessary  that  all  the  data  points  are 
visited  in  a  random  order  (see  lemma  1)  To  obtain  this 
condition,  it  is  not  enough  to  visit  the  sites  of  the  lattice 
randomly;  it  is  also  necessary,  when,  each  site  i  is  vis¬ 
ited,  to  “flip  a  coin”,  and  only  update  the  corresponding 
center  location  with  probability  Pupdate  —  z{xt)/ :max< 
where  z^ar  =  maxi  z(xi). 


_ Figure  9  around  here _ 

The  results  of  this  procedure  applied  to  the  real  scinti¬ 
graphic  image  of  figure  9-a  are  shown  in  figure  9-b. 
The  white  squares  indicate  the  center  locations,  and  the 
white  line  the  final  compound  decision  boundary.  The 
threshold  0  vists  obtained  as  the  minimum  between  the 
two  largest  peaks  of  the  global  histogram  of  the  image;  in 
the  experiments  we  performed,  however,  we  found  that 
the  final  results  are  not  very  sensitive  to  the  precise  value 
of  neither  this  nor  the  other  parameter  (0). 

4  Conclusions 

In  this  paper  we  have  analyzed  the  local  K-Means  algo¬ 
rithm,  and  have  presented  some  extensions  that  increase 
its  range  of  applicability.  Our  main  contributions  are  the 
following; 

i)  We  have  established  sufficient  conditions  for  the 
convergence  of  this  algorithm  to  a  (local)  minimum 
of  a  quadratic  distorsion  measure  (lemma  1).  In 
doing  so,  we  showed  that  it  can  be  obtained  as  the 
limit  (as  the  parameter  0  becomes  large)  of  a  fam¬ 
ily  of  algorithms  which  are  closely  related  to  those 
obtained  by  minimizing  an  information  distorsion 
measure.  For  moderate  values  of  0,  we  showed 
that  these  algorithms  can  be  used  for  unsupervised 
learning  (clustering)  tasks,  since  they  can  find  a 
neighborhood  structure  for  the  centers  that  reflects 
the  structure  of  the  data. 

ii)  We  showed  that  by  varying  the  parameter  0  and 
adding  a  noise  term  to  the  update  equation,  it  is 
possible  to  improve  significantly  the  performance 
of  the  algorithm  for  clustered  data. 

iii)  We  introduced  a  modification  to  this  algorithm 
that  consists  in  augmenting  the  state  of  each  pro¬ 
cessor  (center)  so  that  it  keeps  track  of  its  own 
dynamic  behavior.  With  this  modification,  it  is 


possible  to  cor^rol  the  algorithm,  so  that,  for  ex¬ 
ample,  the  centers  sample  the  inter-class  bound¬ 
ary  manifold  (for  multi-class  data)  directly.  Since 
this  manifold  has  one  dimension  less  than  the  data 
themselves,  and  it  contains  the  most  relevant  loca¬ 
tions  for  classification  r  > ’-poses,  this  may  represent 
a  considerable  savi''<'  in  computational  resources. 
We  also  showed  thai  this  augmented  state  may  be 
used  for  other  purposes,  such  as  to  control  the  num¬ 
ber  of  centers  in  an  automatic  way. 

iv)  We  showed  that  the  self-organizing  property  of  cer¬ 
tain  variations  of  the  LKMA  (specifically,  Koho- 
nen’s  self-organizing  nets)  can  be  put  in  a  rigor¬ 
ous  framework  by  considering  the  procedure  as  a 
Bayesian  estimator  with  a  Gibbsian  (MRF)  prior. 
This  formulation  has  several  advantages:  firstly, 
the  algorithm  convergence  can  be  rigorously  estab¬ 
lished,  and  the  properties  of  the  stable  states  better 
understood;  secondly,  it  suggests  the  use  of  multi¬ 
scale  (pyramid)  strategies  that  accelerate  the  con¬ 
vergence  and  conduct  the  algorithm  to  better  fi¬ 
nal  configurations,  and  finally,  it  permits  the  easy 
extension  of  the  algorithm  to  generate  a  wide  va¬ 
riety  of  organized  configurations.  Thus,  we  pre¬ 
sented  an  application  that  consisted  in  finding  the 
smooth  curves  that  define  the  inter-class  bound¬ 
aries  in  2-<iimensional  segmentation  problems  from 
sparse  data. 

v)  We  presented  a  classification  scheme  that  con>- 
bines  the  power  of  the  Gaussian  classifiers  with 
the  boundary-sampling  properties  of  the  extended 
LKMA,  allowing  the  construction  of  very  complex 
decision  boundaries  with  very  few  computational 
elements.  We  exemplified  the  performance  of  these 
Local  Gaussian  Classifiers,  both  in  '‘classical”  clas¬ 
sification  problems  and  in  fuzzy  classification  tasks 
related  to  image  segmentation. 

Other  extensions  of  great  interest  are  related  to  the 
use  of  these  procedures  for  the  optimal  location  of  centers 
in  function  approximation  problems.  This  is  one  of  the 
subjects  of  our  current  research. 

A  Convergence  of  Stochastic  Gradient 
Descent 


AJ:  the  sequence  {j/„}  is  bounded  unth  probabi/ily  I 
A2:  /»(■,  )  IS  a  bounded  measurable  -valued  funr- 
tton: 

AS:  there  are  non-negattve  measurable  real-valued 
functions  0(  ),  g(-.  )  such  that  0(')  ts  nondecreasing  as 
its  argument  increases,  ff(u)  —  0  as  u  — >  U,  0(  )  and 
g(-,  )  are  bounded  on  bounded  sets  and 

l|/‘(y.O  -  ^(y'.^)ll  <  ^(lly  -  y'll)y{y.y')  ; 

Af:  there  ts  a  continuous  function  h(  )  such  tha'  for 
each  c  >  0  and  each  y: 

m 

lim  P{  sup  II  V  ai[h{y,^i)  -  h(y)]||  >  < }  =  0 
"•>" 

A5:  {a„)  ts  a  sequence  of  positive  real  numbers  such 
that  a„  — ►  0  as  n  — *  0,  and  <*•»  = 

Let  S  be  a  locally  asymptotically  stable  point  of  the  ordi¬ 
nary  differential  equation: 

y  =  My) 

with  domain  of  attraction  As-  Then,  if  yn  €  G  for  all 
n,  for  some  compact  domain  G  C  As,  (j/n)  converges  to 
S  with  probability  one. 

Lenruna  2.1  can  be  derived  from  theorem  (A.l)  set¬ 
ting  {fn}  to  be  a  sequence  of  random  variables,  with 
uniform  probability  distribution,  that  take  values  in 
{1,2,3, ...,N}  and  setting 

My.O  =  -V/o(y)-V/<(y)  ,  My)  = -VF(y)  . 
Under  these  conditions  we  just  need  to  check  the  validity 
of  assumptions  A3  and  A4  in  theorem  (A.l). 

A3:  If  we  assume  that  the  /,  are  differentiable  func¬ 
tions  whose  derivatives  satisfy  a  Lipschiz  condition,  then 

II  -  V/o(y)  -  Vf((y)  +  V/o(y')  +  V/t(y')||  <  Af  ||y  -  y'|| 
for  some  positive  number  M .  Therefore  assumption  A3 
holds,  with  fl(u)  =  u  and  g{y)  =  M . 

A4:  We  have  to  prove  that  for  each  c  >  0  and  for  each 

y 


In  this  appendix  we  prove  that,  under  certain  assump¬ 
tions,  a  class  of  stochastic  gradient  descent  techniques, 
that  include  “backpropagation”  as  a  particular  case,  con¬ 
verges  to  the  same  result  as  the  standard  gradient  de¬ 
scent  algorithm.  This  fact  is  stated  in  lemma  2.1  (see 
section  2),  which  is  an  application  of  the  following  theo¬ 
rem  [16]: 

Theorem  A.l  (Kushner  and  Clark,  1978)  Con¬ 
sider  the  following  sequence  in  RP: 


yn+l  =  yn  +  anh(y„,^„)  +  anftn 


(31) 


where  {(„}  is  a  sequence  of  random  variables  that  do  not 
depend  on  the  {y„}  and  ts  a  sequence  of  random 
variables  converging  to  zero.  Assume  the  following: 


lim  P{  sup  II T  0,1- V/o  -  V/{.  +  VF]\\  >  e)  =  0  . 

«  m>n 

*•  isn 

(32) 

Defining  the  random  variable 

s,  =  a.[VF-V/e.-V/o] 
condition  (32)  asserts  that  the  sequence 

m 

Cn  =  8Up  II  Vs, 11 

converges  in  probability.  It  is  sufficient  to  show  that  the 
series  s,-  converges  with  probability  one  (Kushner 
and  Clark,  page  32).  We  use  the  following  theorem  by 
Kolmogorov  and  Khinchin  (see  [27],  page  359): 


Theorem  A. 2  Lei  {;„}  be  a  sequence  of  independent 
random  variables  with  zero  mean.  Then,  if 

f;  E[zl]  <  oc 

n=  1 

the  senes  converges  with  probability  I. 

In  order  to  apply  this  theorem  to  the  random  variable 
s,.  we  first  need  to  check  that  s,  has  zero  mean.  The 
probability  distribution  of  the  random  variables  f  has 
been  assumed  to  be  uniform,  and  therefore  f*(4i)  =  jr 
VVe  then  have: 

E[s.]  =  a.E[VE- V/(.  -  V/o]  = 

=  a.[VE-E[V/t.]-V/o]  = 

n  =  l 

Similarly  we  have 

E[s?]  =  a?E[(VF  -  V/^.  -  Vfof]  =  aJC(y) 
for  some  function  C(y).  If  we  assume  that 

f:<<^ 

n  =  l 

then  Er=i  ^[«n]  =  C'('')E~=i«n  <  +00-  Therefore 
the  series  *'  converges  with  probability  1,  assump¬ 
tion  A4  holds,  and  lemma  2.1  is  proved. 

Note  that  the  aissumptions  about  the  boundedness  of 
the  sequence  {yn},  and  the  boundedness  of  the  gradi¬ 
ent  of  the  fi  are  not  as  restrictive  as  they  seem,  and  in 
practice  may  be  dropped.  In  fact,  we  can  restrict  the 
sequence  {y„}  to  a  compact  region  G  C  Rf ,  considering, 
instead  of  eq.  (31),  the  following  sequence: 

yn+i  =  nG[y„  -  a„(V/o(y„)  -b  V/c,(y„))  -f  a„T;„]  (33) 

where  lie  is  the  projection  onto  G.  In  this  case,  lemma 
2.1  now  holds  if  we  substitute  VF  in  eq.  (12)  with 
IIgVF  Choosing  G  sufficiently  large  to  contain  the 
fixed  points  of  interest,  the  conclusion  of  the  theorem 
is  practically  unchanged.  Moreover,  since  now  the  se¬ 
quence  is  restricted  to  the  compact  set  G,  the  derivatives 
of  the  fi  are  certainly  bounded  on  G,  being  continuous, 
and  therefore  the  assumpions  of  boundedness  of  the  gra¬ 
dient  of  the  fi  can  be  dropped. 
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FIGURE  CAPTIONS 

Figure  1.  Centers  and  neighborhood  structures  for  2- 
Dimensional  data  sets.  The  centers  (circled  dots  in  pan¬ 
els  (c)  and  (d))  for  the  data  sets  shown  in  panels  (a) 
and  (b),  respectively,  were  found  using  the  adaptive  al¬ 
gorithm  described  in  section  2.2.1.  The  neighborhood 
structure  is  represented  by  lines  that  connect  two  cen¬ 
ters  if  their  average  link  status  exceeds  0.1  (see  text). 

Figure  2.  Representative  centers  for  the  2-Dimensional 
data  set  of  panel  (a)  found  by  the  standaud  LKMA 
(panel(b))  and  by  the  soft  WTA  aJgorithm  (14) 
(panel(c)).  The  number  of  centers  (9)  was  kept  fixed 
in  both  cases. 

Figure  3.  Panel  (a)  portrays  a  2-class,  2-dimensional 
data  set  of  200  points;  points  of  class  1  are  represented  by 
dots,  and  points  of  class  2  by  signs  or  small  circles. 
The  circled  dots  of  panel  (b)  represent  the  centers  of  the 
boundary  set  (see  section  2.2.2).  the  upper  threshold 
B,  was  set  to  0.05  to  have  approximately  10  points  per 
center. 

Figure  4.  “Folded”  lattice  obtained  as  a  local  minimum 
of  energy  (20)  with  quadratic  potentials  and  cliques  of 
size  3.  (b)  Lattice  obtained  with  the  pyramid  process 
described  in  the  text.  The  data  points  are  uniformly 
distributed  in  the  unit  square. 

Figure  5.  “Pyramid”  process  for  adding  centers:  circles 
represent  the  nodes  of  the  initial  3x3  lattice;  “X”  signs 
correspond  to  the  centers  added  after  the  first  refine¬ 
ment. 

Figure  6.  Boundaries  from  sparse  data:  panel  (b)  por¬ 
trays  the  discrete  boundary  curve  (polygonal  line)  ob¬ 
tained  as  a  fixed  point  of  (23)  for  the  data  set  of  panel 
(a)  (dots  represent  points  of  class  1  and  plus  signs  or 
small  circles,  points  of  class  2)  with  A  =  0.01.  Psmel  (c) 
portrays  the  fixed  point  for  A  =  0.1. 

Figure  7.  Panels  (c)  and  (d)  portray  boundary  curves 
found  by  algorithm  (23)  for  the  data  sets  of  panels  (a) 
and  (b)  (dots  represent  points  of  class  1  and  plus  signs 
or  small  circles,  points  of  class  2). 

Figure  8.  Decision  boundaries  found  by  the  local  Gaus¬ 
sian  classifier  described  in  section  3.2  for  the  data  set  of 
figure  3. 

Figure  9.  Panel  (a)  portrays  a  real  scintigraphic  image 
of  a  human  heart  taken  from  a  left  anterior  oblique  pro¬ 
jection.  Panel  (b)  shows  the  superimposed  boundaries 
found  by  the  fuzzy  LGC  described  in  section  3.3. 
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